02 - Example - Handling Duplicates and Missing Values

This notebook presents how to eliminate duplicates and solve the missing values.

By: Hugo Lopes
Learning Unit 08

Some inital imports:


In [1]:
import pandas as pd
import numpy as np
% matplotlib inline
from matplotlib import pyplot as plt

Load the dataset that will be used

Using the data from the previous unit (07-data-diagnostics)


In [2]:
data = pd.read_csv('../data/data_with_problems.csv', index_col=0)
print('Our dataset has %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(15)


Our dataset has 3 columns (features) and 200 rows (people).
Out[2]:
age height gender
CFLOXRHMDR 88.0 163.0 female
FXLJSNLSOG 29.0 158.0 female
FWDIVJKGOI 42.0 159.0 female
YWEBKQWHRE 25.0 179.0 male
YPUQAPSOYJ 32.0 169.0 male
YPUQAPSOYJ 32.0 169.0 male
YPUQAPSOYJ 32.0 169.0 male
YPUQAPSOYJ 32.0 169.0 male
SSZQEGTLNK NaN 162.0 male
PRFEFXNGWN 36.0 166.0 female
IIVXDNOAIV 1.0 165.0 female
VVQYVNRAGQ 18.0 134.0 female
YVEDWPTEEB 31.0 149.0 female
LGPQPJXWPI 34.0 172.0 female
VURXKXJUTM 22.0 174.0 female

Dealing with found issues

Time to deal with the issues previously found.

1) Duplicated data

Drop the duplicated rows (which have all column values the same), check the YPUQAPSOYJ row above. Let us use the drop_duplicates to help us with that by keeping only the first of the duplicated rows.


In [3]:
mask_duplicated = data.duplicated(keep='first')
mask_duplicated.head(10)


Out[3]:
CFLOXRHMDR    False
FXLJSNLSOG    False
FWDIVJKGOI    False
YWEBKQWHRE    False
YPUQAPSOYJ    False
YPUQAPSOYJ     True
YPUQAPSOYJ     True
YPUQAPSOYJ     True
SSZQEGTLNK    False
PRFEFXNGWN    False
dtype: bool

In [4]:
data = data.drop_duplicates(keep='first')
print('Our dataset has now %d columns (features) and %d rows (people).' % (data.shape[1], data.shape[0]))
data.head(10)


Our dataset has now 3 columns (features) and 194 rows (people).
Out[4]:
age height gender
CFLOXRHMDR 88.0 163.0 female
FXLJSNLSOG 29.0 158.0 female
FWDIVJKGOI 42.0 159.0 female
YWEBKQWHRE 25.0 179.0 male
YPUQAPSOYJ 32.0 169.0 male
SSZQEGTLNK NaN 162.0 male
PRFEFXNGWN 36.0 166.0 female
IIVXDNOAIV 1.0 165.0 female
VVQYVNRAGQ 18.0 134.0 female
YVEDWPTEEB 31.0 149.0 female

You could also consider a duplicate a row with the same index and same age only by setting data.drop_duplicates(subset=['age'], keep='first'), but in our case it would lead to the same result. Note that in general it is not a recommended programming practice to use the argument 'inplace=True' (e.g., data.drop_duplicates(subset=['age'], keep='first', inplace=True)) --> may lead to unnexpected results.

2) Missing Values

This one of the major, if not the biggest, data problems that we faced with. There are several ways to deal with them, e.g.:

  • drop rows which contain missing values.
  • fill missing values with zero.
  • fill missing values with mean of the column the missing value is located.
  • (more advanced) use decision trees to predict the missing values.

In [5]:
missing_data = data.isnull()
print('Number of missing values (NaN) per column/feature:')
print(missing_data.sum())
print('And we currently have %d rows.' % data.shape[0])


Number of missing values (NaN) per column/feature:
age       9
height    4
gender    9
dtype: int64
And we currently have 194 rows.

That is not terrible to the point of fully dropping a column/feature due to the amount of missing values. Nevertheless, the action to do that would be data.drop('age', axis=1). The missing_data variable is our mask for the missing values:


In [6]:
missing_data.head(8)


Out[6]:
age height gender
CFLOXRHMDR False False False
FXLJSNLSOG False False False
FWDIVJKGOI False False False
YWEBKQWHRE False False False
YPUQAPSOYJ False False False
SSZQEGTLNK True False False
PRFEFXNGWN False False False
IIVXDNOAIV False False False

Drop rows with missing values

This can be done with dropna(), for instance:


In [7]:
data_aux = data.dropna(how='any')
print('Dataset now with %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))


Dataset now with 3 columns (features) and 173 rows (people).

Fill missing values with a specific value (e.g., 0)

This can be done with fillna(), for instance:


In [8]:
data_aux = data.fillna(value=0)
print('Dataset has %d columns (features) and %d rows (people).' % (data_aux.shape[1], data_aux.shape[0]))


Dataset has 3 columns (features) and 194 rows (people).

So, what happened with our dataset? Let's take a look where we had missing values before:


In [9]:
data_aux[missing_data['age']]


Out[9]:
age height gender
SSZQEGTLNK 0.0 162.0 male
TJQPFEFVVH 0.0 182.0 0
PYHWLDVICX 0.0 181.0 female
MLRPKGKACD 0.0 185.0 male
SGMGUJEBNC 0.0 173.0 MALE
YZDOYNOXAF 0.0 144.0 female
UAOAMGUQSX 0.0 144.0 male
JFVZOEGUUA 0.0 208.0 female
VYAQBLJKXJ 0.0 165.0 male

In [10]:
data_aux[missing_data['height']]


Out[10]:
age height gender
CWCFROPRFE 22.0 0.0 male
EORSIPDIHA 21.0 0.0 MALE
NGJOHICWSY 41.0 0.0 male
LNLAPFIJEQ 37.0 0.0 male

In [11]:
data_aux[missing_data['gender']]


Out[11]:
age height gender
TJQPFEFVVH 0.0 182.0 0
QXUGUHCOPT 101.0 196.0 0
LKEHZFGGTS 49.0 177.0 0
EBTRPEDHJS 43.0 147.0 0
BDFQWIHWCH 27.0 167.0 0
NUCCGRJLXN 20.0 159.0 0
GQSNBZIGBL 27.0 197.0 0
KWJJMPVSCP 24.0 189.0 0
LMZUTCGFYT 21.0 153.0 0

Looks like what we did was not the most appropriate. For instance, we create a new category in the gender column:


In [12]:
data_aux['gender'].value_counts()


Out[12]:
female    111
male       65
MALE        9
0           9
Name: gender, dtype: int64

Fill missing values with mean/mode/median:

This is one of the most common approaches.

height - filling missing values with the mean


In [13]:
data['height'] = data['height'].replace(np.nan, data['height'].mean())
data[missing_data['height']]


Out[13]:
age height gender
CWCFROPRFE 22.0 168.847368 male
EORSIPDIHA 21.0 168.847368 MALE
NGJOHICWSY 41.0 168.847368 male
LNLAPFIJEQ 37.0 168.847368 male

age - filling missing values with median


In [14]:
data.loc[missing_data['age'], 'age'] = data['age'].median()
data[missing_data['age']]


Out[14]:
age height gender
SSZQEGTLNK 27.0 162.0 male
TJQPFEFVVH 27.0 182.0 NaN
PYHWLDVICX 27.0 181.0 female
MLRPKGKACD 27.0 185.0 male
SGMGUJEBNC 27.0 173.0 MALE
YZDOYNOXAF 27.0 144.0 female
UAOAMGUQSX 27.0 144.0 male
JFVZOEGUUA 27.0 208.0 female
VYAQBLJKXJ 27.0 165.0 male

gender - filling missing values with mode

Remember we had a small problem with the data of this feature (the MALE word instead of male)? Typing problems are very common and they can be hidden problems. That's why it is so important to take a look at the data.


In [15]:
data['gender'].value_counts(dropna=False)


Out[15]:
female    111
male       65
MALE        9
NaN         9
Name: gender, dtype: int64

Let's replace MALE by male to harmonize our feature.


In [16]:
mask = data['gender'] == 'MALE'
data.loc[mask, 'gender'] = 'male'
# validate we don't have MALE:
data['gender'].value_counts(dropna=False)


Out[16]:
female    111
male       74
NaN         9
Name: gender, dtype: int64

Now we don't have the MALE entry anymore. Let us fill the missing values with the mode:


In [17]:
the_mode = data['gender'].mode()
# note that mode() return a dataframe
the_mode


Out[17]:
0    female
dtype: object

In [18]:
data['gender'] = data['gender'].replace(np.nan, data['gender'].mode()[0])
data[missing_data['gender']]


Out[18]:
age height gender
TJQPFEFVVH 27.0 182.0 female
QXUGUHCOPT 101.0 196.0 female
LKEHZFGGTS 49.0 177.0 female
EBTRPEDHJS 43.0 147.0 female
BDFQWIHWCH 27.0 167.0 female
NUCCGRJLXN 20.0 159.0 female
GQSNBZIGBL 27.0 197.0 female
KWJJMPVSCP 24.0 189.0 female
LMZUTCGFYT 21.0 153.0 female

Final check

Always a good idea...


In [19]:
data.isnull().sum()


Out[19]:
age       0
height    0
gender    0
dtype: int64